1 Introduction

In this report I will explore the red wine quality dataset, a dataset that contains 1,599 red wines with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  • chlorides: the amount of salt in the wine

  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  • density: the density of water is close to that of water depending on the percent alcohol and sugar content

  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  • sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  • alcohol: the percent alcohol content of the wine

  • quality (score between 0 and 10)

2 Exploration

3 Univariate Plots Section

In this section, we will first observe the structure of the dataset. Then for each variable of the dataset we will plot an histogram to better comprehend the distribution of the variable and a boxplot when needed to better visualize the variability of the variable.

3.1 Structure

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The red wine quality dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).

3.2 Quality

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The score is supposed to be between 0 and 10 but we see that it falls only between 3 and 8.The distribution seems to be normally distributed, with a most common value of 5. More than 96% of the red wine samples have a minimum quality of 5.

3.3 Fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

While the values are distributed between 4.60 and 15.90, most of them are between 7.10 and 9.20.

Some values (>14.50) seem to be outliers, we might want to adjust the axes.

The distribution is slightly right skewed so the median of 7.90 is a better measure of the center.

3.4 Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

While the values are distributed between 0.1200 and 1.5800, most of them are between 0.3900 and 0.6400.

Some values (>1) seem to be outliers, we might want to adjust the axes.

The distribution seemed slightly right skewed before, now it looks rather normal with a median approximately equal to the mean of 0.5200. We can see some peaks at 0.42 and 0.56.

3.5 Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

While the values are distributed between 0 and 1, most of them are between 0.090 and 0.420

One value (= 1) seems to be an outlier, we might want to adjust the axes.

The distribution seems slightly right skewed so the median of 0.260 is a better measure of the center. We can see multiple peaks at 0, 0.25 and 0.47.

3.6 Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

While the values are distributed between 0.900 and 15.500, most of them are between 1.900 and 2.600.

Some values (>6.9) seem to be outliers, we might want to adjust the axes.

The distribution looks normal around the peak but is slightly right skewed so the median of 2.200 is a better measure of the center.

3.7 Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

While the values are distributed between 0.01200 and 0.61100, most of them are between 0.07000 and 0.09000.

Some values (> 0.3) seem to be outliers, we might want to adjust the axes.

The distribution looks normal around the peak but is slightly right skewed so the median of 0.07900 is a better measure of the center.

3.8 Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

While the values are distributed between 1 and 72, most of them are between 7 and 21.

Some values (> 58) seem to be outliers, we might want to adjust the axes.

The distribution is right skewed so the median of 14 is a better measure of the center.

3.9 Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

While the values are distributed between 6.00 and 289.00, most of them are between 22.00 and 62.00.

Some values (> 175) seem to be outliers, we might want to adjust the axes.

The distribution is slightly right skewed so the median of 38.00 is a better measure of the center.

3.10 Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

While the values are distributed between 0.9901 and 1.0037, most of them are between 0.9956 and 0.9978.

The distribution seems normally distributed with a mean of 0.9967.

3.11 PH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

While the values are distributed between 2.740 and 4.010, most of them are between 3.210 and 3.400.

The distribution seems normally distributed with a mean of 3.311.

3.12 Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

While the values are distributed between 0.3300 and 2.0000, most of them are between 0.5500 and 0.7300.

Some values (> 1.5) seem to be outliers, we might want to adjust the axes.

The distribution is slightly right skewed so the median of 0.6200 is a better measure of the center.

3.13 Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

While the values are distributed between 8.40 and 14.90, most of them are between 9.50 and 11.10.

Some values (> 14) seem to be outliers, we might want to adjust the axes.

The distribution is slightly right skewed so the median of 10.20 is a better measure of the center.

4 Univariate Analysis

What is the structure of your dataset?

The red wine quality dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).

What I found :

  • the mode for the quality is 5 and more than 96% of the red wine samples have a minimum quality of 5.
  • the distribution of the fixed acidity is right skewed and the median is 7.90
  • the distribution of the volatile acidity is normally distributed and the mean is 0.5200
  • the distribution of the citric acid is right skewed, the median is 0.260 and 132 observations have a citric acid of 0.
  • the distribution of the residual sugar is right skewed and the median is 2.200
  • the distribution of the chlorides is slightly right skewed and the median is 0.07900
  • the distribution of the free sulfur dioxide is right skewed and the median is 14
  • the distribution of the total sulfur dioxide is right skewed and the median is 38
  • the distribution of the density is normally distributed and the mean is 0.9967
  • the distribution of the pH is normally distributed and the mean is 3.311
  • the distribution of the sulphates is right skewed and the median is 0.6200
  • the distribution of the alcohol percentage is right skewed and the median is 10.20

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in our dataset is the quality. A good question to ask ourself would be to know which variables contribute to a high quality wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

So far I can’t really put aside any variables so I would say that all the other 11 variables can at this stage support my investigation into my feature of interest.

Did you create any new variables from existing variables in the dataset?

I didn’t create a new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some distributions caught my attention. First, the quality is supposed to be between 0 and 10 but the range of the dataset’s ratings is only from 3 to 8 supposing that the extreme ratings are very rare or even impossible. Then, in the citric acid distribution there are 132 observations with a citric acid value of 0. Even though it is stated that it is found in same quantities, it is still 8% of the dataset without citric acid.

The dataset was already tidy and there was no missing values so I did not have to perform any action during the exploration. However, some outliers are present in some distributions so I have to take that into consideration for my further explorations.

5 Bivariate Plots Section

5.1 Correlation Matrix

Let’s look at a correlation matrix to try to understand the relationship between variables.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

For a correlation coefficient r, we define :

  • Strong relationship : 0.7 <= |r| <= 1.0
  • Moderate relationship : 0.3 <= |r| < 0.7
  • Weak relationship : 0.0 <= |r| < 0.3

We assume that we are only interested in the relationships between all variables that are at least moderate, and for the relationships involving our main feature (quality) the ones that have a correlation coefficient of at least +/-0.2.

Following the previous statement, we notice the following relationships :

relationship correlation coefficient Strength Direction
citric acid - fixed acidity 0.67170343 moderate positive
citric acid - volatile acidity -0.552495685 moderate negative
total sulfure - free sulfure 0.667666450 moderate positive
density - fixed acidity 0.66804729 moderate positive
density - citric acid 0.36494718 moderate positive
density - residual sugar 0.355283371 moderate positive
ph - fixed acidity -0.68297819 moderate negative
ph - citric acid -0.54190414 moderate negative
sulphates - citric acide 0.31277004 moderate positive
sulphates - chlorides 0.371260481 moderate positive
alcohol - density -0.49617977 moderate negative
quality - volatile acidity -0.390557780 moderate negative
quality - alcohol 0.47616632 moderate positive
quality - citric acid 0.22637251 weak positive
quality - sulphates 0.251397079 weak positive

We will see these relationships more in details in the following sections.

5.2 citric acid - fixed acidity

It appears that the more citric.acid there is, the more fixed.acidity there is. However, there are a lot of variations when the value of citric.acid increases.

5.3 citric acid - volatile acidity

It appears that the more citric.acid there is, the less volatile.acidity there is. However, there are still a lot of variations.

5.4 total sulfure - free sulfure

It appears that the more free.sulfur.dioxide there is, the more total.sulfur.dioxide there is. There is a peak around around 37 of free.sulfur.dioxide.

5.5 density - fixed acidity

It appears that the more density there is, the more fixed.acidity there is. However there are a lot of variations.

5.6 density - citric acid

It appears that the more density there is, the more citric.acid there is. However there are a lot of variations.

5.7 density - residual sugar

The relationship looks quite weak, even though there are so peaks we can’t be sure there is a real relationship between these two variables.

5.8 ph - fixed acidity

It appears that the more pH there is, the less fixed.acidity there is. It is indeed logical as higher values of pH correspond to more basic liquid.

5.9 ph - citric acid

It appears that the more pH there is, the less citric acid there is. It is indeed logical as higher values of pH correspond to more basic liquid.

5.10 sulphates - citric acide

It appears that the more citric.acid there is, the more sulphates there is, especially after 0.75 of citric.acid where the amount of sulphates increases a lot.

5.11 sulphates - chlorides

The relationship between sulphates and chlorides is not quite clear. There are some peaks but it seems quite random.

5.12 alcohol - density

It appears that the more alcohol there is, the less density there is.

5.13 quality - volatile acidity

We can observe a trend right here : it seems that lower volatile.acidity mean higher quality.

5.14 quality - alcohol

Apart from the value for the quality 5, we can observe a trend right here : it seems that higer alcohol mean higher quality.

5.15 quality - citric acid

We can observe a trend right here : it seems that higher citric.acid mean higher quality.

5.16 quality - sulphates

We can also observe a trend right here : it seems that higher sulphates mean higher quality.

6 Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I started my analysis by creating a correlation matrix in order to understand better the relationship between variables.

I narrowed the number of relationships that I was interested in by only keeping the relationships between all variables that are at least moderate, and for the relationships involving our main feature (quality) the ones that have a correlation coefficient of at least +/-0.2. I did that so I could focus only on the most predominent relationships.

Of the 15 relationships kept for exploration, 4 concerned the quality variable and I observed these trends :

  • less volatile acidity means higher quality : it makes sense as a high level of volatile acidity can lead to an unpleasant vinegar taste.
  • more alcohol means higher quality : the percentage of alcohol is probably responsible for the taste.
  • more citric acid means higher quality : it makes sense as citric acid can add freshness and flavor to wines.
  • more sulphates means higher quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Of the 15 relationships kept for exploration, 11 concerned the other features and I observed these trends :

  • more citric acid means more fixed acidity, more sulphates and less volatile acidity.
  • more free sulfure dioxide means more total sulfure dioxide.
  • more density means more fixed acidity and citric acid .
  • more ph means less fixed acidity and citric acid.
  • more alcohol means less density.

What was the strongest relationship you found?

Concerning the feature of interest, the strongest relationship that I found was between the quality and the alcohol with a correlation coefficient of 0.47616632, meaning it is a moderate positive relationship.

Concerning the other features, the strongest relationship that I found was between the ph and the fixed acidity with a correlation coefficient of -0.68297819, meaning it is a moderate (almost strong) negative relationship. It is indeed logical as higher values of pH correspond to more basic liquid.

7 Multivariate Plots Section

First I will try to visualize relationships between the feature of interest and 2 other features, then I will try to visualize relationships between 3 other features.

# create a quality_rating variable that classify the quality in 3 categories
wine_df$quality_rating <- ifelse(wine_df$quality < 5, 'Bad', 
            ifelse(wine_df$quality < 7, 
            'Average', 'Good'))
wine_df$quality_rating <- ordered(wine_df$quality_rating, 
            levels = c('Bad', 'Average', 'Good'))

In this new section, I created a new variable quality_rating containing the rating of the wine (Bad, Average or Good) according to the quality so I could facet wrap any future visualization with that variable.

7.1 Quality features

7.1.1 Quality - volatile acidity - alcohol

High quality wines seems to have high alcohol and low volatile acidity.

7.1.2 Quality - volatile acidity - citric acid

High quality wines seems to have low volatile acidity and high citric acid.

7.1.3 Quality - volatile acidity - sulphates

High quality wines seems to have high sulphates and low volatile acidity.

7.1.4 Quality - alcohol - citric acid

We can see that high alcohol tends to high quality wines but we can’t really say anything about the citric acid here.

7.1.5 Quality - alcohol - sulphates

High quality wines seems to have high alcohol and high sulphates.

7.1.6 Quality - citric acid - sulphates

We can see that high sulphates tends to high quality wines but we can’t really say anything about the citric acid here.

7.1.7 Quality - volatile acidity - chloride

High quality wines seems to have low chlorides and low volatile acidity.

7.1.8 Quality - volatile acidity - density

High quality wines seems to have low volatile acidity and low density (even though the relationship seems to be week for the density).

7.1.9 Quality - alcohol - residual sugar

There doesn’t seem to be any meaningful relationship between the alcohol and residal sugar.

7.2 Other features

7.2.1 Citric - fixed acidity - volatile acidity

High fixed acidity and low volatile acidity tends to high citric acid.

7.2.2 Density - fixed acidity - citric acid

High citric acid and high fixed acid tends to highest density

7.2.3 PH - fixed acidity - citric acid

Low citric acid and low fixed acidity tends to high pH.

8 Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The following combinations seems to contribute to a high quality wine :

  • high alcohol with low volatile acidity
  • high alcohol with high sulphates

These were the relationships that were the easiest to find.

Were there any interesting or surprising interactions between features?

The thing that surprised me the most is that there is no meaningful relationship between residual sugar and alcohol.


9 Final Plots and Summary

9.0.1 Plot One

9.0.2 Description One

The plot above is a bar plot showing the distribution of the quality (from 0 to 10) in the red wines dataset.

The quality is supposed to be between 0 and 10 but we see that it falls only between 3 and 8. Maybe no such things as really good wines or really bad wines exist, or maybe the dataset doesn’t have these wines. Additionally, more than 96% of the red wine samples have a minimum quality of 5, meaning there are not a lot of bad wines in the dataset.

9.0.3 Plot Two

This plot shows a box plot of the alcohol percentage for each quality. This is a good way to represent the relationship between the alcohol and the quality as it allows to see the evolution of the mean and the variablity of the alcohol percentage for each quality.

We notice that apart from the boxplot for quality 5, a trend seems to be emerging : it seems that a higer percentage of alcohol tends to a higher quality.

9.0.4 Description Two

Apart from the value for the quality 5, we can observe a trend right here : it seems that higher alcohol percentage leads to higher quality.

9.0.5 Plot Three

9.0.6 Description Three

This scatterplot shows the relationship between the alcohol percentage and the volatile acidity, while showing at the same time the quality of each observation.

We notice two clusters of points :

  • a cluster of low quality wines (red-orange) at the top left
  • a cluster of high quality wines (blue) at the bottom right.

We can then state that high quality red wines tend to have high alcohol percentage (as seen previously) but also low volatile acidity.

This observation is not surprising as a too high level of volatile acidity can lead to an unpleasant, vinegar taste, leading to a lower quality.


10 Reflection

This project was interesting because it allowed me to put into practice the different steps of Exploratory Data Analysis with a powerful language like R.

The dataset I worked on is the red wine quality dataset. This dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).

First of all, I did an univariate exploration. First, I observed the structure of the dataset by displaying its dimensions, and the types of its variables. Then, for each of the variables in the dataset, I displayed a summary and its histogram to get an overview of its distribution. This allowed me to know how it was distributed (right skewed, left skewed or normal) and if there were any outliers. It also allowed me to strengthen my understanding of the dataset. After this exploration, I chose to focus mainly on the quality feature, and I asked myself what were the variables contributing to a high quality wine.

After that, I did a bivariate exploration. I started my analysis by creating a correlation matrix in order to understand better the relationship between the variables. I narrowed the number of relationships that I was interested in by only keeping the relationships between all variables that are at least moderate, and for the relationships involving the main feature (quality) the ones that have a correlation coefficient of at least +/-0.2. I did that so I could focus only on the most predominent relationships. Concerning the main feature, I ended up with 4 relationships : less volatile acidity means higher quality, more alcohol means higher quality, more citric acid means higher quality and more sulphates means higher quality.

On the final part of the EDA I did a multivariate exploration. Since there were many variables to consider and many variable associations that could be made, I first decided to focus on the relationships involving the variable of interest (quality) and 2 other variables. I could for instance understand that high quality wines seems to have high alcohol, low volatile acidity (responsible of vinegar taste at high quantity so it is logical) and high sulphates. Then, I focused myself of a few set of 3 other variables that showed correlations in the bivariate exploration. I could for instance understand that low acid citric and low fixed acidity tends to high pH, which is logical as higher values of pH correspond to more basic (less acid) solutions.

During this project, I encountered difficulties mainly when interpreting the plots of the Multivariates Exploration. Indeed, when a third variable is added, the plot sometimes immediately becomes less clear and the relationships much less obvious to determine. To counter this problem, I have created a new variable quality_rating which classifies the quality variable into 3 categories “bad”, “average” and “good”. I then plotted the variables again but made a facet_wrap with this new quality_rating variable. This allowed me to focus only on wines with good quality_rating, and it helped me to discover some trends.

Among the successes of this project, I was especially surprised at how much the correlation matrix helped me in this exploration. It allowed me to guide my analysis and discover patterns and trends. This is definitely something I will rely on in my future analyses.

In the future, the analysis could be enriched by combining the red wine quality dataset with the white wine quality dataset. It might be interesting to determine the commonalities and differences between these two datasets, and this could also allow us to discover new insights about the 2 datasets.